Skip to content

Implement a new --failing-and-slow-first command line argument to test runner. #24624

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

juj
Copy link
Collaborator

@juj juj commented Jun 26, 2025

This keeps track of results of previous test run, and on subsequent runs, failing tests are run first, then skipped tests, and last, successful tests in slowest-first order. This improves parallelism throughput of the suite.

Add support for --failfast in the multithreaded test suite to help stop suite runs at first test failures quickly.

These two flags --failfast and --failing-and-slow-first together can help achieve < 10 second test suite runs on a CI when the suite is failing.

Example core0 runtime with test/runner core0 on a 16-core/32-thread system:

Total core time: 2818.016s. Wallclock time: 118.083s. Parallelization: 23.86x.

Same suite runtime with test/runner --failing-and-slow-first core0:

Total core time: 2940.180s. Wallclock time: 94.027s. Parallelization: 31.27x.

Gaining a better throughput and a -20.37% test suite wall time.

juj added 6 commits June 26, 2025 19:33
…t runner. This keeps track of results of previous test run, and on subsequent runs, failing tests are run first, then skipped tests, and last, successful tests in slowest-first order. Add support for --failfast in the multithreaded test suite. This improves parallelism throughput of the suite, and helps stop at test failures quickly.
Copy link
Collaborator

@sbc100 sbc100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this is what I currently use --failfast --continue for. The downside of --failfast --continue of course is that it doesn't work for parallel testing (so I also add -j1).

Copy link
Collaborator

@sbc100 sbc100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually maybe I misunderstood. I use --failfast --continue when implementing new features and wanting to fix each test failure as I run into it.

How does this improve CI times on the bots? It seems like it would not effect the first run, but only subsequent runs, which the bots don't do, do they?

@juj
Copy link
Collaborator Author

juj commented Aug 13, 2025

How does this improve CI times on the bots? It seems like it would not effect the first run, but only subsequent runs, which the bots don't do, do they?

It doesn't work on the current CircleCI bots, which always start from a clean slate and run all suites from a single command invocation, but it does help if a developer runs test suites locally, and on the ad hoc CI I am running in http://clbri.com:8010/ .

For example, here is one such run:

image

where all the failing suites fail in a matter of a few seconds, rather than taking a random length to fail.

Also passing suites run faster, since shortest tests are run last, meaning that core utilization will be 100% throughout the test suite run. It is like a self-calibrating version to avoid having to name tests test_zzz_ if they are slow. (which is detrimental to test speed)

@juj
Copy link
Collaborator Author

juj commented Aug 15, 2025

It would be great to get this landed, since this would enable my CI to run against the upstream tree more easily.

@sbc100
Copy link
Collaborator

sbc100 commented Aug 15, 2025

It would be great to get this landed, since this would enable my CI to run against the upstream tree more easily.

It seems like there are couple different things intertwined here.

Perhaps we can tease some of it apart and try to simplify.

The first thing here is making --fail-fast work in the parallel running. That seems like a great idea and maybe we can land that separtely.

Regarding running slow tests first, how about just making that the default? For the initial run we could check in a copy of the test time and update it every few months. Then we wouldn't need a new flag.

Perhaps --failing-first could be a new flags, but again it might make sense to just make it the default?

@sbc100
Copy link
Collaborator

sbc100 commented Aug 15, 2025

Why is this change important for your CI?


def addError(self, test, err):
print(test, '... ERROR', file=sys.stderr)
self.buffered_result = BufferedTestError(test, err)
self.test_result = 'errored'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It this needed? Isn't the existing buffered_result object enough?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python doesn't have an API to ask the result of a test object, and it would have required some kind of an awkward isinstance() jungle to convert the test to a string, so I opted to writing simple looking code as a more preferable way.

try:
previous_test_run_results = json.load(open('out/__previous_test_run_results.json'))
except FileNotFoundError:
previous_test_run_results = {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to duplicate this code for handling __previous_test_run_results between here and runner.py?

Perhaps a shared sorting function that they can both call?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is not creating a sorter, but I see the shared block amounts to a

def load_previous_test_run_results():
    try:
      return json.load(open('out/__previous_test_run_results.json'))
    except FileNotFoundError:
      return {}

I could refactor that to e.g. common.py, though that is not a large win necessarily.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess i don't understand quite what is going on here then.. I would have though the new flag --failing-and-slow-first flag would apply equally to the non-parallel and parallel test runner and so it would only need to be handled in a single place (where we decide on the test ordering)... I need to take a deeper look at what is really going on here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merged the code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new flag only applies to the parallel test runner. If we wanted to extend this to the non-parallel runner, that's fine, though that could be worked on as a later PR as well.

@juj
Copy link
Collaborator Author

juj commented Aug 15, 2025

Why is this change important for your CI?

It allows me to run all suites of tests in a more reasonable time to iterate on seeing different failures. Otherwise the failing suites take 30-50x longer to come around with the failure, making it time consuming to see what is still failing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants